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A  HIGH  PERFORMANCE  COMPUTING  APPROACH  TO  THE 
SIMULATION  OF  FLUID-SOLID  INTERACTION  PROBLEMS  WITH 
RIGID  AND  FLEXIBLE  COMPONENTS 


This  work  outlines  a  unified  multi-threaded,  multi-scale  High  Performance 
Computing  (HPC)  approach  for  the  direct  numerical  simulation  of  Fluid-Solid  Inter¬ 
action  (FSI)  problems.  The  simulation  algorithm  relies  on  the  extended  Smoothed 
Particle  Hydrodynamics  (XSPH)  method,  which  approaches  the  fluid  flow  in  a  La- 
grangian  framework  consistent  with  the  Lagrangian  tracking  of  the  solid  phase. 
A  general  3D  rigid  body  dynamics  and  an  Absolute  Nodal  Coordinate  Formulation 
(ANCF)  are  implemented  to  model  rigid  and  flexible  multibody  dynamics.  The  two- 
way  coupling  of  the  fluid  and  solid  phases  is  supported  through  use  of  Boundary 
Condition  Enforcing  (BCE)  markers  that  capture  the  fluid-solid  coupling  forces  by 
enforcing  a  no-slip  boundary  condition.  The  solid-solid  short  range  interaction,  which 
has  a  crucial  impact  on  the  small-scale  behavior  of  fluid-solid  mixtures,  is  resolved 
via  a  lubrication  force  model.  The  collective  system  states  are  integrated  in  time  using 
an  explicit,  multi-rate  scheme.  To  alleviate  the  heavy  computational  load,  the  overall 
algorithm  leverages  parallel  computing  on  Graphics  Processing  Unit  (GPU)  cards. 
Performance  and  scaling  analysis  are  provided  for  simulations  scenarios  involving 
one  or  multiple  phases  with  up  to  tens  of  thousands  of  solid  objects.  The  software 
implementation  of  the  approach,  called  Chrono::Fluid,  is  part  of  the  Chrono  project 
and  available  as  an  open-source  software. 


1.  Introduction 

The  last  decade  witnessed  a  paradigm  shift  in  the  microprocessor  industry 
towards  chip  designs  that  provide  strong  support  for  parallel  computing.  Ter- 
aflop  computing,  i.e.  computing  at  the  rate  of  1012  FLoating-point  Operation 
Per  Second,  until  recently  the  privilege  of  a  select  group  of  large  research 
centers,  is  becoming  a  commodity  due  to  inexpensive  GPU  cards  and  multi- 
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to  many-core  x86  processors.  The  steady  improvements  in  processor  speed 
and  architecture,  memory  layout  and  capacity,  and  power  efficiency  have 
motivated  a  trend  of  re-evaluating  simulation  algorithms  with  an  eye  towards 
identifying  solutions  that  map  well  onto  these  new  hardware  platforms.  In  this 
context,  a  scrutinizing  of  the  existing  Fluid-Solid  Interaction  (FSI)  solutions 
reveals  that  they  are  mostly  inadequate  since  they  either  rely  on  algorithms 
that  do  not  map  well  on  emerging  computer  architectures,  or  they  are  unable 
to  capture  phenomena  of  interest  which  may  require  the  simulation  of  tens 
of  thousands  of  rigid  and  deformable  bodies  that  interact  directly  or  through 
the  fluid  media. 

FSI  problems  are  usually  simulated  in  either  an  Eulerian-Lagrangian 
framework,  where  the  Lagrangian  solid  phase  moves  with/over  the  Eulerian 
grid  used  for  fluid  discretization,  or  an  Eulerian-Eulerian  framework,  where 
the  solid  phase  is  considered  as  an  average  of  ensembles  instead  of  a  specific 
state.  While  the  former  approach  delivers  the  flexibility  required  by  general 
multibody  dynamics  problems,  it  does  not  provide  a  level  of  performance 
suitable  for  the  simulation  of  dense  fluid-solid  problems  such  as  suspensions. 
This  can  be  traced  back  to  the  costly  processes  that  overlaps  the  two  different 
representations  of  the  fluid  and  solid  phases. 

The  Lagrangian  representation  of  the  fluid  flow  dovetails  well  with  the 
Lagrangian  approach  used  for  the  motion  of  the  solid  phase.  This  approach 
has  recently  been  employed  in  the  context  of  multibody  dynamics  and  particle 
suspension  in  [12,  25,  29-31,  331.  Specifically,  the  FSI  methodology  relying 
on  Smoothed  Particle  Hydrodynamics  (SPH)  [16,  19,  22]  and  rigid  body 
Newton-Euler  equation  of  motion  [10]  was  discussed  in  detail  in  [29,  301. 
This  or  a  similar  framework  was  used  for  the  investigation  and  validation 
of  rigid  body  suspensions  [30],  as  well  as  flexible  beams  interacting  with 
fluid  [12,  31,  33].  Additionally,  support  for  the  short-range  solid-solid  inter¬ 
action  was  introduced  in  [30,  31],  which  was  leveraged  in  the  simulation  of 
dense  suspensions. 

This  contribution  concentrates  on  the  high  performance  computing  ap¬ 
proach  required  for  the  efficient  simulation  of  FSI  problems.  Although  the 
approach  and  algorithms  explained  herein  can  leverage  any  multi-threaded 
architecture,  the  CUDA  library  [26]  was  employed  for  the  execution  of  all 
solution  components  on  the  GPU,  with  negligible  data  transfer  between  the 
host  (CPU)  and  the  device  (GPU).  The  paper  is  organized  as  follows.  The 
numerical  methods  adopted  for  the  simulation  of  fluid  and  solid  phases, 
fluid-solid  coupling,  and  solid-solid  short  range  interaction  are  explained  in 
Sect.  2.  The  computational  scheme,  including  the  time  integration  algorithm, 
HPC -based  implementation,  and  parallel  neighbor  search  required  for  SPH 
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are  described  in  Sect.  3.  Section  4  contains  a  performance  analysis  of  the 
algorithm  using  Kepler-type  GPU  cards. 

2.  Numerical  approach 

The  proposed  FSI  simulation  framework  combines  SPH  for  the  simula¬ 
tion  of  the  fluid  flow,  the  Newton-Euler  formalism  for  rigid  body  dynamics, 
and  ANCF  for  modeling  flexible  body  dynamics.  These  algorithmic  compo¬ 
nents  are  described  in  more  detail  in  the  following  subsections,  including  a 
discussion  on  the  approach  adopted  for  resolving  fluid-solid  and  solid-solid 
interactions. 


2.1.  Smoothed  Particle  Hydrodynamics 


The  term  smoothed  in  SPH  refers  to  the  approximation  of  point  proper¬ 
ties  via  a  smoothing  kernel  function  W,  defined  over  a  support  domain  S. 
This  approximation  reproduces  functions  with  up  to  2nd  order  of  accuracy, 
provided  the  kernel  function:  (1)  approaches  the  Dirac  delta  function  as  the 

size  of  the  support  domain  tends  to  zero,  that  is  lim  W(r,  h)  =  <5(r),  where  r 

h — >0 

is  the  spatial  distance  and  h  is  a  characteristic  length  that  defines  the  kernel 
smoothness;  (2)  is  symmetric,  i.e.,  W(r,h)  =  W(-r,h);  and  (3)  is  normal, 
i.e.,  Js  W(r,  h)dV  =  1,  where  dV  denote  the  differential  volume.  A  typical 
spatial  function  /(x)  is  then  approximated  by  </(x))  as 


/(x)  =  J /(x')d(x  -  x')dV 

s 

=  J  f(x')W(x  -  x!,h) dV  +  0(h2) 


{ fix ))  +  0{h2)  . 


(1) 


To  simplify  notation,  in  the  remainder  of  this  document  we  use  /(x)  to  repre¬ 
sent  (/(x)).  Using  Eq.  (1)  and  the  divergence  theorem,  the  spatial  derivatives 
of  a  function  can  be  mapped  to  the  derivatives  of  the  kernel  function.  For 
instance,  the  gradient  of  a  function  can  be  written  as 


V/(x)  =  f(x')W{x  -  x\  /?.)-ndA 


dS 


x')VVF(x  -  x',  h)dV  , 


5 


(2) 
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where  dS  is  the  boundary  of  the  integration  volume  V  and  dA  is  the  dif¬ 
ferential  area.  By  imposing  an  additional  property  for  the  kernel  function, 
namely  that  (4)  it  approaches  zero  as  r  increases,  i.e.,  lim  W(r,h)  =  0, 

r— >oo 

the  first  term  on  the  right  hand  side  of  Eq.  (2)  vanishes.  Note  that  some 
additional  considerations,  which  will  be  addressed  later,  are  required  for  the 
SPH  approximation  near  boundaries. 

The  term  particle  in  SPH  terminology  indicates  the  discretization  of  the 
domain  by  a  set  of  Lagrangian  particles.  To  remove  the  ambiguity  caused  by 
the  use  of  the  term  rigid  particles  in  the  context  of  FSI  problems,  we  use  the 
term  marker  to  refer  to  the  SPH  discretization  unit.  Each  marker  has  mass 
m  associated  to  the  representative  volume  dV  and  carries  all  of  the  essential 
field  properties.  As  a  result,  any  field  property  at  a  certain  location  is  shared 
and  represented  by  the  markers  in  the  vicinity  of  that  location.  The  value  of 
a  certain  unknown  at  a  given  location  is  calculated  according  to  the  distance 
between  that  location  and  the  collection  of  markers  in  its  vicinity  using  the 
field  values  at  these  markers  and  the  expression  of  the  kernel  function  W. 
This  leads  for  the  second  approximation  embedded  in  SPH,  which  can  be 
expressed  as 

/(x)  =  J'^^W(x-x',h)p(x')dV 

(3) 


Y  —f(xb)W(x-xb,h) 
^  Pb 


where  b  is  the  marker  index  and  pb  is  the  fluid  density,  smoothed  at  the 
marker  location  xh.  The  summation  in  Eq.  (3)  is  over  all  markers  whose 
support  domain  overlaps  the  location  x.  Several  other  properties  of  the  kernel 
functions  are  provided  in  [16].  For  instance,  the  kernel  function  should  be  a 
positive  and  monotonically  decreasing  function  of  r,  which  implies  that  the 
influence  of  distant  markers  on  field  properties  at  a  given  location  is  less  than 
that  of  nearby  markers.  Moreover,  for  acceptable  computational  performance 
and  to  avoid  a  quadratic  computational  complexity,  kernel  functions  have  a 
compact  domain  of  influence  with  a  radius  Rs  defined  as  some  finite  multiple 
k  of  the  characteristic  length  h,  as  shown  in  Figure  1.  The  methodology 
adopted  herein  relies  on  a  cubic  spline, 


(2  -  qf  -  4(1  -  qf,  0<q<l 


W(q,h)  = 


4: r/?3 


Xj(2- qf, 

.0, 


1  <  q  <  2  , 
q  >2 


(4) 


where  q  =  \r\/h,  has  a  support  domain  with  radius  2 h. 
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Fig.  1.  Illustration  of  the  kernel  W  and  its  support  domain  S.  SPH  markers  are  shown  as  black 
dots.  For  2D  problems  the  support  domain  is  a  circle,  while  for  3D  problems  it  is  a  sphere 


2.1.1.  SPH  for  fluid  dynamics 


For  fluid  dynamics,  the  continuity  and  momentum  equations  are  given 
in  Lagrangian  form  as 

di-^ 


and 


dx  1 

—  =  — V p  +  -V“v  +  f  , 
at  p  p 


(6) 


where  p  is  the  fluid  viscosity,  while  v  and  p  are  the  flow  velocity  and  pressure, 
respectively.  Within  the  SPH  framework  described  earlier,  Eqs.  (5)  and  (6) 
are  discretized  at  an  arbitrary  location  x  =  xa  within  the  fluid  domain  as  [241 


In  the  above  equations,  quantities  with  subscripts  a  and  b  are  associated 
with  markers  a  and  b  (see  Figure  1),  respectively.  It  is  important  to  note  that 
these  quantities  are  different  from  the  corresponding  physical  quantities  at 
locations  xa  and  xb.  The  viscosity  term  Y[ab  is  defined  as 


if^a  Pi)  )X(//; '  W,,,, 

PlMlb  + 


V ab  ? 


(9) 


where  xab  =  xa  -  xb,  Wab  =  W(xab,  h),  Va  is  the  gradient  with  respect  to  x,(, 
i.e.,  d/dxa,  and  e  is  a  regularization  coefficient.  Quantities  with  an  over-bar 
are  averages  of  the  corresponding  quantities  for  markers  a  and  b. 
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Alternative  viscosity  discretizations  include: 

1.  the  model  suggested  in  [5]: 

Id ^  —  ~PaPb^ab'^ab/\PaPbiPa  "t"  Pb)(Xajf  ^  sh~a^)\^ ctW ab  > 

2.  direct  discretization  of  V2  operator  [22,  241;  and 

3.  the  class  of  artificial  viscosity  models  introduced  in  [17,  19,  23] 

However,  using  Eq.  9  is  preferred  since  it  has  the  following  properties: 
(1)  it  ensures  that  the  viscous  force  is  along  the  shear  direction  \ab,  instead 
of  the  particles  center  line  xab,  (2)  it  is  less  sensitive  to  local  velocities; 
(3)  it  allows  for  better  computational  efficiency  by  removing  the  nested  loop 
required  for  the  computation  of  the  V2  operator;  and  (4)  it  is  stated  in  terms 
of  physical  properties,  rather  than  model  parameters  like  those  in  artificial 
viscosity,  which  are  introduced  primarily  for  numerical  stabilization  through 
damping.  In  the  simulation  of  transient  Poiseuille  flow,  although  we  man¬ 
aged  to  virtually  obtain  exact  results  using  Eq.  9  [30],  the  error  caused  by 
implementing  either  Yi*ab  or  artificial  viscosity  were  non-negligible. 

In  the  weakly  compressible  SPH  model,  the  pressure  p  is  evaluated  using 
an  equation  of  state  [22] 


P  = 


cfpo 

y 


p_ 

Po 


(10) 


where  po  is  the  reference  density  of  the  fluid,  y  tunes  the  stiffness  of  the 
pressure-density  relationship,  and  cs  is  the  speed  of  sound.  The  value  cs  is 
adjusted  depending  on  the  maximum  speed  of  the  flow,  Vmax,  to  keep  the  flow 
compressibility  below  some  arbitrary  value.  We  use  y  =  7  and  cs  =  10Pmax, 
which  allows  1%  flow  compressibility  [22], 

The  fluid  flow  equations  (7)  and  (8)  are  solved  together  with 


dxa 

=  —r~  =  V, 

dt 


(ID 


to  update  the  position  of  the  SPH  markers. 

The  original  SPH  summation  formula  calculates  the  density  according 
to 

Pa  =  ^mb  Wab-  (12) 

b 

Equation  (7),  which  evaluates  the  time  derivative  of  the  density,  was  preferred 
to  the  above  since  it  produces  a  smooth  density  field,  works  well  for  markers 
close  to  the  boundaries  (the  free  surface,  solid,  and  wall),  and  does  not  exhibit 
the  large  variations  in  the  density  field  introduced  when  using  Eq.  (12)  close 
to  the  boundaries.  However,  Eq.  (7)  does  not  guarantee  consistency  between 
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a  marker’s  density  and  the  associated  mass  and  volume  [3,  21,  24].  The 
so-called  density  re-initialization  technique  [6]  attempts  to  address  this  issue 
by  using  Eq.  (7)  at  each  time  step  and  periodically;  i.e.,  every  n  time  steps, 
use  Eq.  (12)  to  correct  the  mass-density  inconsistency.  The  results  reported 
herein  were  obtained  with  n  =  10.  The  Moving  Least  Squares  method  or 
a  normalized  version  of  Eq.  (12)  could  alternatively  be  used  to  address  the 
aforementioned  issues,  see  [6,  7]. 

Finally,  the  methodology  proposed  employs  the  extended  SPH  approach 
(XSPH),  which  prevents  extensive  overlap  of  markers’  support  domain  and 
enhances  flow  incompressibility  [20].  This  correction  takes  into  account  the 
velocity  of  neighboring  markers  through  a  mean  velocity  evaluated  within 
the  support  of  a  nominal  marker  a  as 

\a  =  \a  +  A\a,  (13) 

where 

Ava  =  V  —(v*  -  va)Wab,  (14) 

bPab 

and  0  <  £  <  1  adjusts  the  contribution  of  the  neighbors’  velocities.  All  the 
results  reported  herein  were  obtained  with  £  =  0.5.  The  modified  velocity 
calculated  from  Eq.  (13)  replaces  the  original  velocity  in  the  density  and 
position  update  equations,  but  not  in  the  momentum  equation  [6]. 

2.2.  Rigid  body  dynamics 

Once  the  fluid-solid  interactions  between  individual  markers;  i.e.,  the 
right  hand  side  of  Eqs.  (7)  and  (8),  are  accounted  for,  the  total  rigid  body 
force  and  torque  due  to  the  interaction  with  the  fluid  can  be  obtained  by 
respectively  summing  the  individual  forces  and  their  induced  torques  over  the 
entire  rigid  body.  They  are  then  added  to  any  other  forces,  including  external 
and  contact  forces.  The  dynamics  of  the  rigid  bodies  is  fully  characterized 
by  the  Newton-Euler  equations  of  motion  (EOM),  see  for  instance  [10]: 


sj  ^ 

II 

(15) 

dJk- v. 

dt 

(16) 

--  j;-1  (t;  -  m'J'm')  , 

(17) 

17  =  2°'  • 

(18) 
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and 

qfq,-l=0,  (19) 

where  F„  T',  X(,  V,,  <u'  €  M3  denote  the  force,  torque,  position,  velocity, 
and  angular  velocity  associated  to  body  i,  i  =  1,2, .. . ,  nh,  respectively.  The 
quantity  q,  denotes  the  rotation  quaternion,  while  A4,  and  J'  are  the  mass 
and  moment  of  inertia,  respectively.  Quantities  with  a  prime  symbol  are 

represented  in  the  rigid  body  local  reference  frame.  Given  a  =  [a*,  ay,  a,l  € 

6  M3  and  q  =  [qx,  qy,  qz,  qw j  6  M4,  the  auxiliary  matrices  a  and  G  are 
defined  as  (101 


0 

~az 

ay 

cly 

Qw 

~Qz 

a  = 

az 

0 

~Clx 

and  G  = 

~clw 

Qx 

Qy 

~Cly 

0 

—Qw 

<h 

-qy 

Qx 

2.3.  Flexible  body  dynamics 

The  ANCF  formulation  [361,  which  allows  for  large  deformations  and 
large  rigid  body  rotations,  is  adopted  herein  for  the  simulation  of  flexible 
bodies  suspended  in  the  fluid.  While  extension  to  other  elastic  elements 
is  straightforward,  the  current  Chrono::  Fluid  implementation  only  supports 
gradient  deficient  ANCF  beam  elements,  which  are  used  to  model  slender 
flexible  bodies  composed  of  ne  adjacent  ANCF  beam  elements.  The  flexible 
bodies  are  modeled  using  a  number  nn  =  ne  +  1  of  equally-spaced  node  beam 
elements,  each  represented  by  6  coordinates,  ey  =  [r^ ,  r^v]r,  j  =  0, 1, . . . ,  ne; 
i.e.,  the  three  components  of  the  global  position  vector  of  the  node,  and  the 
three  components  of  its  slope.  This  is  therefore  equivalent  to  a  model  using 
ne  ANCF  beam  elements  with  6  x  n„  continuity  constraints,  but  is  more 
efficient  in  that  it  uses  a  minimal  set  of  coordinates.  We  note  that  formula¬ 
tions  using  gradient  deficient  ANCF  beam  elements  display  no  shear  locking 
problems  [9,  34,  35]  and,  due  to  the  reduced  number  of  nodal  coordinates, 
are  more  efficient  than  fully  parametrized  ANCF  elements.  However,  gradient 
deficient  ANCF  beam  elements  cannot  describe  a  rotation  about  its  axis  and 
therefore  cannot  model  torsional  effects. 

Consider  first  a  single  ANCF  beam  element  of  length  t.  The  global 
position  vector  of  an  arbitrary  point  on  the  beam  center  line,  specified  through 
its  element  spatial  coordinate  0  <  x  <  f,  is  then  obtained  as 

r(x,  e)  =  S(x)e ,  (21) 
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where  e  =  [ej ,  eJ]T  e  M12  is  the  vector  of  element  nodal  coordinates. 
With  I  being  the  3x3  identity  matrix,  the  shape  function  matrix  S  = 
=  [Sil  Sol  S3 1  S4I]  €  M3x12  is  defined  using  the  shape  functions  [36] 

Si  =  l -3f-  +  2£3 

s-. 2  =  e(t-2e+e) 

S: 3  =  3^2  -  2£3 
54  =  €(-^2+^3), 

where  £  =  x/€  €  [0, 11. 

The  element  EOM  are  then  written  as 

Me  +  Q'  =  , 

where  Qe  and  Q"  are  the  generalized  element  elastic  and  applied  forces, 
respectively,  and  M  €  M 1 2x  1 2  is  the  symmetric  consistent  element  mass  matrix 
defined  as 

M  =  JpsASrSdx.  (24) 

The  generalized  element  elastic  forces  are  obtained  from  the  strain  energy 
expression  [36]  as 

Qe  =  J'eAsu  (-^t)  dx  +  £ElK^j  dx  ’ 

where  en  =  (r2  rv  -  I  j  /2  is  the  axial  strain  and  k  =  ||rx  x  rxx||/||rx||3  is  the 
magnitude  of  the  curvature  vector.  The  required  derivatives  of  the  position 
vector  r  can  be  easily  obtained  from  Eq.  (21)  in  terms  of  the  derivatives  of 
the  shape  functions  as  rx(jc,e)  =  Sx(jc)e  and  rxx(x,  e)  =  Sxx(x)e. 

External  applied  forces,  in  particular  the  forces  due  to  the  interaction  with 
the  fluid,  see  Sect.  2.4,  are  included  as  concentrated  forces  at  a  BCE  marker. 
The  corresponding  generalized  forces  are  obtained  from  the  expression  of 
the  virtual  work  as 

Qa  =  Sr(x„)F ,  (26) 

where  F  is  the  external  point  force  and  the  shape  function  matrix  is  evaluated 
at  the  projection  onto  the  element’s  center  line  of  the  force  application  point. 
The  generalized  gravitational  force  can  be  computed  as 

Qs  =  J^psASTgdx  .  (27) 


(22) 


(23) 
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In  the  above  expressions,  ps  represents  the  element  mass  density,  A  is  the 
cross  section  area,  E  is  the  modulus  of  elasticity,  and  I  is  the  second  moment 
of  area. 

The  EOM  for  a  slender  flexible  body  composed  of  ne  ANCF  beam  ele¬ 
ments  are  obtained  by  assembling  the  elemental  EOMs  of  Eq.  (23)  and  taking 
into  consideration  that  adjacent  beam  elements  share  6  nodal  coordinates. 

Let  e  =  |ej ,  c[  ,  ...  e’^  | 7  be  the  set  of  independent  nodal  coordinates; 
then  the  nodal  coordinates  for  the  j-th  element  can  be  written  using  the 
mapping 


B,e,  with  B, 


I3 

...0 


0 

I3 


...0 

...0 


(28) 


and  the  assembled  EOMs  are  obtained,  from  the  principle  of  virtual  work, 
as  follows.  Denoting  by  My  be  the  element  mass  matrix  of  Eq.  (24)  for  the 
j-th  ANCF  beam  element,  it  can  be  written  in  block  form  as 


M, 


My,//  My,/,. 

My,,./  M 


(29) 


where  My,/,.  =  M T.yl  and  all  sub-blocks  have  dimension  6x6.  Here,  1  denotes 
the  left  end  of  the  beam  element,  i.e.,  the  node  characterized  by  the  nodal 
coordinates  e7_i ,  while  r  corresponds  to  the  node  with  coordinates  e7.  With 
a  similar  decomposition  of  a  generalized  element  force  into 


Q/  = 


Qa 

Qy> 


one  obtains 

M  =  Qa  -  Qe 

where 


M 


Mu, 

Mi,,./ 


Mi,/r 

Ml,rr  +  M2 ,// 
M2,r/ 


M 


ne,rr 


2Qu 

Qh 

XQl  +  XQh 

Ql  +  Q  2,/ 

Qa  -  Qe  = 

ZQl+ZQb 

- 

Q 1,  +  Qh 

a  e 

■■  a 

-  Qhr  - 

(30) 

(31) 


(32) 


(33) 
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Inclusion  of  additional  kinematic  constraints,  e.g.,  anchoring  the  beam  at  one 
end  to  obtain  a  flexible  cantilever  or  fixing  its  position  to  obtain  a  flexible  pen¬ 
dulum,  can  be  done  either  by  formulating  the  EOM  as  differential-algebraic 
equations  or  by  deriving  an  underlying  ODE  after  explicitly  eliminating  the 
corresponding  constrained  nodal  coordinates.  The  latter  approach  was  used 
in  all  simulations  involving  flexible  cantilevers  discussed  in  Sect.  4. 

2.4.  Fluid-solid  interaction 

The  two-way  fluid-solid  coupling  was  implemented  based  on  a  method¬ 
ology  described  in  [29].  The  state  update  of  any  SPH  marker  relies  on  the 
properties  of  its  neighbors  and  resolves  shear  as  well  as  normal  inter-marker 
forces.  For  the  SPH  markers  close  to  solid  surfaces,  the  SPH  summations 
presented  in  Eqs.  (7),  (8),  (12),  and  (14)  capture  the  contribution  of  fluid 
markers.  The  contribution  of  solid  objects  is  calculated  via  Boundary  Con¬ 
dition  Enforcing  (BCE)  markers  placed  on  and  close  to  the  solid  surface 
as  shown  in  Figure  2.  The  velocity  of  a  BCE  marker  is  obtained  from  the 
rigid/deformable  body  motion  of  the  solid  thus  ensuring  the  no-slip  condition 
on  the  solid  surface.  Including  BCE  markers  in  the  SPH  summation  equations 
(7)  and  (8)  results  in  fluid-solid  interaction  forces  that  are  added  to  both  fluid 
and  solid  markers. 


Fig.  2.  Fluid-solid  interaction  using  BCE  markers  attached  to  a  rigid  body.  BCE  and  fluid 
markers  are  represented  by  black  and  white  circles,  respectively.  The  BCE  markers  positioned  in 
the  interior  of  the  body  (markers  g  and  /  in  the  figure)  should  be  placed  to  a  depth  less  than  or 
equal  to  the  size  of  the  compact  support  associated  with  the  kernel  function  W.  A  similar  concept 
yet  with  different  kinematics  is  used  for  flexible  bodies 


2.5.  Solid-solid  short  range  interaction 

Dry  friction  models  typically  used  to  characterize  the  dynamics  of  gran¬ 
ular  materials  [1,  13  14]  do  not  capture  accurately  the  impact  of  solid  sur¬ 
faces  in  hydrodynamics  media.  In  practice,  it  is  infeasible  to  resolve  the 
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short-range,  high-intensity  impact  forces  arising  in  wet  media  due  to  com¬ 
putational  limits  on  space  and  time  resolution.  Ladd  [15]  proposed  a  normal 
lubrication  force  between  two  spheres  that  increases  rapidly  as  the  distance 
between  particles  approaches  zero  and  therefore  prevents  the  actual  touching 
of  the  spheres: 


F'f  =  min 


(34) 


where,  a,  and  cij  are  the  sphere  radii,  v„  is  the  normal  component  of  the 
relative  velocity,  and  a  is  the  distance  between  surfaces.  For  a  >  Ac,  F^'/;  =  0, 
and  the  spheres  are  subject  only  to  hydrodynamic  forces. 

Equation  (34)  provides  a  basic  model  for  the  estimation  of  the  lubrication 
force  in  normal  direction.  The  generalization  of  this  model  to  non-spherical 
objects  requires  the  calculation  of  the  minimum  distance  and  the  curvature 
of  the  two  contact  surfaces.  The  calculation  of  the  partial  lubrication  force 
between  non-spherical  surfaces  follows  the  approach  proposed  in  [8]  for  a 
lattice  Boltzmann  method  but  is  amended  to  lit  the  Lagrangian  formulation 
adopted  herein.  Accordingly,  the  force  model  provided  in  Eq.  (34)  is  modified 
as 

F“  =  £f‘, 

‘  ,  3  n  n  ,  <35) 

with  f *  =  min  ^ ,  0|  ■  V*  j  , 

where  5*  and  v*  denote  the  markers’  relative  distance  and  velocity,  respec¬ 
tively,  and  the  summation  is  over  all  interacting  markers  of  the  two  solid 
objects. 


3.  HPC  implementation 

Chrono::Fluid  [4],  an  open-source  simulation  framework  for  fluid-solid 
interaction,  provides  an  implementation  that  executes  all  phases  of  the  so¬ 
lution  method  on  the  GPU.  A  brief  overview  of  the  GPU  hardware  and 
programming  model  adopted  is  followed  below  by  a  detailed  discussion  of 
the  critical  kernels  that  implement  the  proposed  modeling  and  solution  ap¬ 
proach. 


3.1.  GPU  hardware  and  programming  model 

To  a  very  large  extent,  the  performance  of  today’s  simulation  engines  is 
dictated  by  the  memory  bandwidth  of  the  hardware  solution  adopted.  Recent 
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numerical  experiments  conducted  by  the  authors  revealed  that  only  about  5 
to  10%  of  the  peak  flop  rate  is  reached  by  the  large  number  of  cores  present 
on  today’s  architectures  since  these  cores  most  of  the  time  idle  waiting  for 
data  from  global  memory  or  RAM.  It  is  this  observation  that  motivated  the 
selection  of  the  GPU  as  the  target  hardware  for  implementing  Chrono:: Fluid. 
At  roughly  300  GB/s,  the  GPU  memory  bandwidth  stands  four  to  five  times 
higher  than  what  one  could  expect  on  a  fast  CPU. 

To  describe  the  hardware  organization  of  the  GPU,  we  consider  here  the 
NVIDIA  GeForce  GTX  680  [18,  27].  This  GPU  is  based  on  the  first  genera¬ 
tion  of  Kepler  architecture,  code  name  GK104,  which  is  also  implemented  in 
Tesla  K10.  The  Kepler  architecture  relies  on  a  Graphics  Processing  Cluster 
(GPC)  as  the  defining  high-level  hardware  block.  There  are  a  total  of  4  GPCs 
on  the  GK104.  Each  GPC  includes  two  Stream  Multiprocessors  (SM),  each 
of  which  has  192  Scalar  Processors  (SP),  for  a  total  of  1536  SPs,  and  3.1 
TFlops  processing  rate. 

In  addition  to  the  processing  cores,  the  second  important  aspect  of  GPU 
hardware  is  that  of  the  memory  hierarchy.  The  memory  on  the  GPU  is  di¬ 
vided  into  several  types,  each  with  different  access  patterns,  latencies,  and 
bandwidths: 

Registers  (read/write  per  thread):  65536,  32-bit  memory  units.  Very  low 
latency,  high  bandwidth  (-  10  TB/s  cumulative)  memory  used  to  hold 
thread-local  data. 

Shared  memory/Ll  cache  (read/write  per  block):  64  KB.  Low  latency,  high 
bandwidth  (-  1.5-2  TB/s  cumulative)  memory  divided  between  shared 
memory  and  LI  cache. 

Global  memory  (read/write  per  grid):  2  GB.  Used  to  hold  input  and  output 
data.  Accessible  by  all  threads,  with  a  bandwidth  of  192  GB/s  and  a 
higher  latency  (-  400-800  cycles)  than  shared  memory  and  registers.  All 
accesses  to  global  memory  pass  through  the  L2  cache.  The  latter  is  512 
KB  large  and  has  a  bandwidth  of  512  B/clock  cycle. 

Constant  memory  (read  only  per  grid):  48  KB  per  Kepler  SM.  Used  to 
hold  constants,  serviced  at  the  latency  and  bandwidth  of  the  LI  cache 
upon  a  cache  hit,  or  those  of  the  global  memory  upon  a  cache  miss. 

The  parallel  execution  paradigm  best  supported  on  GPUs  is  “single  in¬ 
struction  multiple  data”  (SIMD).  In  this  model,  if  for  instance  two  arrays  of 
length  one  million  are  to  be  added,  one  million  threads  are  launched  with 
each  executing  the  same  instruction;  i.e.,  adding  two  numbers,  but  on  dif¬ 
ferent  data  -  each  thread  adding  two  different  numbers.  Lortunately,  SIMD 
computing  is  very  prevalent  in  the  solution  methodology  proposed,  where 
each  SPH  marker  is  handled  by  a  thread  in  the  same  way  in  which  hundreds 
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of  thousands  of  other  threads  handle  their  markers  using  different  data.  While 
strongly  leveraging  the  SIMD  model  owing  to  the  line  grain  parallelism  that  it 
exposes,  the  methodology  adopted  is  prone  to  lead  to  memory  access  patterns 
that  do  not  display  high  spatial  and/or  temporal  locality.  This  is  because  the 
SPH  markers  move  in  time  leading  to  less  structured  memory  accesses  that 
adversely  impact  the  effective  bandwidth  reached  by  the  code. 

3.2.  Time  integration 

Chrono::Fluid  uses  a  second  order  explicit  mid-point  Runge-Kutta  (RK2) 
scheme  [2]  for  the  time  integration  of  the  fluid  and  solid  phases,  the  latter 
in  its  rigid  or  flexible  representation. 

Algorithm  1  summarizes  the  steps  required  for  the  calculation  of  the  force 
on  the  SPH  markers,  rigid  bodies,  and  deformable  beams  at  time  step  k. 


Algorithm  1  Force  Calculation 

>  Calculate  modified  XSPH  velocities 
l:  for  a  :=  0  to  (Nm  -  1)  do 

2:  V*  =  \a(pk,Xk,\k) 

3:  end  for 


4 

5 

6 

7 

8 


>  SPH  forces 

for  a  :=  0  to  ( N„ ,  -  1)  do 

Pa  =pa(pky,yk) 

jf  _  ^ k 

\k  =  \a(pk,xk,\k) 

end  for 


>  Rigid  body  forces 
9:  for  i  :=  0  to  (Nr  -  1)  do 

10:  V*  =  V,(i*) 

11:  Xk  =  \k 

l  l 

12:  OJk  =  OJ,  (V, 

13:  qf  =  q,(mf,qf) 

14:  end  for 


>  ANCF  forces 
15:  for  j  :=  0  to  (Nf  -  l)  do 

16:  Qk  =  Q"  (v*)  -  Q}  (ek)  +  Qj  (eA) 

17:  k\  =  e^ 

18:  kk2  =  M-lQk 

19:  end  for 
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The  variables  Nm,  Nr,  and  Nf  denote  the  number  of  markers,  rigid  bodies, 
and  flexible  beams,  respectively.  The  arrays  p,  x,  v,  and  v  store  the  density, 
position,  velocity,  and  modified  velocity  for  all  markers,  respectively;  for 
example,  p  =  {pa\a  =  0, 1,2,  ...,Nm  -  1}. 

The  external  forces  on  the  rigid  and  flexible  bodies  include  the  FSI  forces 
captured  via  BCE  markers  at  distributed  locations  on  the  solid  surfaces.  The 
distributed  forces  need  to  be  accumulated  into  a  single  force  and  torque 
at  the  center  of  each  rigid  body,  or  point  forces  at  node  locations  of  each 
flexible  body.  The  reduction  (summation  of  the  forces  and  torques)  is  handled 
by  parallel  scan  operations  available  through  the  Thrust  library  (11],  which 
exposes  a  scan  algorithm  that  scales  linearly. 

The  RK2  integration  scheme  requires  the  calculation  of  the  force  at  the 
beginning  as  well  as  middle  of  the  time  step.  Algorithm  2  lists  the  steps 
required  for  the  time  integration  of  a  typical  FSI  problem. 

To  improve  the  code  vectorization  and  use  of  fast  memory;  i.e.,  L1/L2 
cache,  shared  memory,  and  registers,  each  computation  task  was  implement¬ 
ed  as  a  sequence  of  light  GPU  kernels.  For  instance,  different  computation 
kernels  are  implemented  to  update  the  attributes  of  the  solid  bodies,  in¬ 
cluding  force,  moment,  rotation,  translation,  linear  and  angular  velocity,  and 
locations  of  the  associated  BCE  markers.  A  similar  coding  philosophy  was 
maintained  for  the  density  re-initialization,  boundary  condition  implementa¬ 
tion,  and  mapping  of  the  marker  data  on  an  Eulerian  grid  for  post  processing. 

3.2.1.  Multi-rate  integration 

Stable  integration  of  the  SPH  fluid  equations  requires  step-sizes  which 
are  also  appropriate  for  propagating  the  dynamics  of  any  rigid  solids  in  the 
FSI  system.  However,  integration  of  the  dynamics  of  deformable  bodies, 
especially  as  their  stiffness  increases,  calls  for  very  small  time  steps.  To 
alleviate  the  associated  computational  cost,  we  use  a  dual-rate  integration 
scheme  using  intermediate  steps  for  the  integration  of  the  flexible  dynamics 
EOMs,  typically  with  AtSPH/AtANCF  =  10,  although  stiffer  problems  may 
require  ratios  of  up  to  50. 

This  aspect  is  noteworthy  given  that  typical  FSI  models  involve  a  number 
of  SPH  markers  orders  of  magnitude  larger  than  the  number  of  ANCF  nodal 
coordinates  required  for  the  flexible  bodies.  Without  a  multi-rate  implemen¬ 
tation,  the  numerical  solution  would  visit  the  fluid  phase  at  each  integration 
step,  an  approach  that  led  to  very  large  solution  times.  This  aspect  is  further 
discussed  in  Sect.  4. 
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Algorithm  2  RK2  Time  Integration 

>  Force  calculation  at  beginning  of  step  ( Algorithm  1) 
l:  Calculate  {\k,  pk,  xk,  \k,  \k,  Xk,  iok,  qk,  QA} 

>  Half- step  updates  for  fluid,  rigid  body,  and  flexible  body  states 
2:  for  a  €  { a\a  is  a  fluid  marker}  do 

3:  fka+m  =  Iff*  +  Iff  a  x  At/2,  where  e  {pa,  xa,  v0} 

4:  end  for 

5:  for  i  :=  0  to  (Nr  -  1)  do 

6:  x¥-+1'2  =  Tf  +  X  At/2,  where  T,  €  {V„  X„  op,  q,| 

7:  end  for 

8:  for  j  :=  0  to  (Nf  -  l)  do 

9:  e{+1/2  =  eA  +  x  At/2 

10:  e2+1/2  =  ek  At/2 

it:  end  for 

>  Half-step  update  for  BCE  marker  positions  and  velocities 
12:  for  a  €  { a\a  is  a  BCE  marker}  do 

13:  Obtain  xA+1/2  and  vA+l/2  according  to  associated  rigid,  flexible,  or  im¬ 

mersed  boundary  motion. 

14:  end  for 

>  Force  calculation  at  half-step  (Algorithm  1 ) 

15:  Calculate  {\k+l/2,  pk+l/2,  kk+l/2,  kA+1/2,  VA+1/2,  XA+1/2,  d/+1/2,  qA+1/2, 

Q^+!/2} 

>  Full-step  updates  for  fluid,  rigid  body,  and  flexible  body  states 
16:  for  a  €  {a\a  is  a  fluid  marker}  do 

17:  flk+l  =flk  +  <A a+m  X  At 

18:  end  for 

19:  for  i  :=  0  to  (Nr  -  1)  do 
20:  Trt+1  =  *¥k  +  '1/A  +  1/2  X  At 

l  l  l 

21:  end  for 

22:  for  j  :=  0  to  (Nf  -  l)  do 
23:  eA+1  =  eA  +  eA+1/2  x  At 
24:  eA+1  -  ek  +  eA+1/2  x  At 

25:  end  for 

>  Full-step  update  for  BCE  marker  positions  and  velocities 
26:  for  a  €  { a\a  is  a  BCE  marker}  do 

27:  Obtain  xA+1  and  vA+1. 

28:  end  for 
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3.3.  Proximity  calculation  and  neighbor  search 

In  the  current  implementation,  each  marker  has  a  list  of  neighbor  mark¬ 
ers;  i.e.,  markers  that  fall  within  its  support  domain.  Given  this  set  of  lists,  the 
calculation  of  v*  =  \a  (pk,xk,  v*),  pk  =  pa  [pk,  rk,  v*),  and  i?k  =  vQ  [pk,  rk,  v*) 
is  straightforward.  The  loop  iterations  in  Algorithms  1  and  2  have  no  overlap 
and  can  be  executed  in  parallel.  The  computational  bottleneck  thus  becomes 
the  determination  of  the  neighbor  lists  through  proximity  calculation,  a  step 
that  requires  about  70%  of  the  entire  computational  budget  and  thus  critically 
impacts  the  overall  performance  of  the  simulation.  Herein,  we  adopt  a  spatial 
binning  algorithm  which  is  sub-optimal  in  terms  of  amount  of  net  work  but 
superior  in  terms  of  memory  usage.  In  this  approach,  the  computation  domain 
is  divided  into  a  collection  of  bins.  The  bins  have  side  lengths  equal  to  the 
maximum  influence  distance  of  a  marker,  i.e.  kIi.  This  localizes  the  search 
for  the  possible  interacting  markers  to  the  bin  and  all  of  its  26  immediate 
3D  neighbors.  Figure  3  shows  the  binning  approach  in  2D. 


Fig.  3.  Illustration  of  the  spatial  subdivision  method  used  for  proximity  computation  in  2D.  The 
circles  represent  the  domain  of  influence  of  each  marker;  i.e.,  the  support  domain.  For  clarity,  a 
coarse  distribution  of  markers  is  shown.  In  reality,  the  concentration  of  markers  per  bin  is  much 

larger 


In  the  neighbor  search  approach  adopted  herein,  neighbor  lists  are  not 
saved  in  memory;  instead,  neighbors  are  evaluated  whenever  required.  Al¬ 
ternatively  [28],  the  principle  of  action  and  reaction  could  be  leveraged  to 
calculate  and  store  the  acceleration  terms  for  each  inter-penetration.  Although 
the  second  algorithm  reduces  the  amount  of  work  by  re-using  the  accelera¬ 
tion  terms  calculated  for  each  inter-penetration,  it  can  not  be  applied  to  the 
SPH  method  due  to  the  massive  amount  of  memory  required  to  store  the 
data  associated  with  all  inter-penetration  events.  Another  advantage  of  the 
adopted  neighbor  search  approach  is  the  coalesced  memory  access  achieved 
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by  sorting  and  accessing  the  data  based  on  the  markers  location.  The  steps 
required  for  accessing  the  neighbors  are  summarized  in  Algorithm  3. 


Algorithm  3  Inner  loop:  accessing  neighbor  markers 

l:  Divide  the  solution  space  into  n b  bins  of  (Av,  Av,  AZJ  dimensions,  where 
nb  =  nx x  ny  X  nz,  and  (nx,  ny,  n.)  is  the  number  of  grid  cells  along  (x,y,z) 
axis. 

2:  Construct  the  hash  array:  =  {sa\a  =  0, 1,2,  -  1}  according  to  sa  = 

=  ixriyXn-  +jxn-  +  k,  where  (i,  j,  k)  is  the  location  of  the  bin  containing 
the  marker  a. 

3:  Sort  into  sorted  and  obtain  corresponding  psorted,  xsorted,  y sorted,  and 

V  sorted’ 

4:  Construct  c1  =  {cxp\p  =  0, 1,2, ...,  nb- 1}  and  c2  =  [cp\p  =  0, 1, 2,  ...,nb- 1}, 
where  cp  and  cp  denote  the  two  indices  in  sorted  that  bound  the  sequence 
of  hash  values  sa  =  p. 

5:  Access  markers  data  in  bin  ( i,j,k )  by  loading  \cp,  c2|  portions  of  the 
sorted  arrays  psorted,  xsorted,  v sorted ,  and  y  sorted,  as  needed. 


The  data  sorting  in  step  3  of  Algorithm  3  is  performed  using  the  linear- 
complexity  radix  sort  available  in  the  Thrust  library.  As  a  result,  this  solution 
component  does  not  affect  the  overall  linear  scaling  of  Algorithms  1  and  2. 
Working  with  the  sorted  arrays  for  the  bulk  of  the  computation  has  the 
additional  advantage  of  increasing  the  memory  spatial  locality:  SPH  markers 
that  share  a  neighborhood  in  the  physical  space,  do  so  in  their  memory 
location  as  well. 


4.  Results 

The  simulation  approach  described  in  Sect.  3  was  implemented  to  execute 
in  parallel  on  GPU  cards  using  the  CUDA  programming  environment  [261. 
All  simulations  reported  in  this  section  were  run  on  an  NVIDIA  GeForce 
GTX  680  GPU  card  described  in  sub-section  3.1.  In  all  simulations,  if 
present,  the  flexible  beams  were  modeled  using  ne  =  4  ANCF  beam  ele¬ 
ments,  while  the  integrals  appearing  in  the  elastic  forces  Q'  in  Eq.  (25)  were 
evaluated  using  5  and  3  Gauss  quadrature  points  for  the  axial  and  bending 
elastic  forces,  respectively. 

A  snapshot  from  a  typical  FSI  problem  involving  flexible  bodies  is  shown 
in  Figure  4.  Additional  Chrono::Fluid  simulation  examples  can  be  found 
at  [32].  Comprehensive  validation  studies  of  the  Chrono::Fluid  simulation 
package  are  provided  in  [31].  The  focus  here  is  on  numerical  experiments 
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that  illustrate  the  performance  and  scalability  of  the  proposed  algorithm  and 
its  parallel  implementation. 


Fig.  4.  Example  of  Chrono::Fluid  simulation:  channel  flow  over  an  array  of  flexible  cantilevers. 

(See  [32]  for  further  details) 


We  begin  by  analyzing  the  efficiency  and  scaling  attributes  of  the  solution 
for  systems  composed  exclusively  of  rigid  bodies,  flexible  bodies,  or  a  fluid 
phase.  In  each  of  these  three  scenarios,  we  provide  the  simulation  times  per 
time  step  for  problems  of  increasing  size.  Data  provided  in  Tables  1,  2,  and 
3  and  illustrated  in  Figure  5  indicate  that  updates  of  the  dynamics  of  each 
phase  scale  linearly  with  problem  size;  i.e.,  with  the  number  of  rigid  bodies, 
flexible  bodies,  and  SPH  markers,  respectively. 


Table  1. 

Time  required  for  advancing  the  rigid  body  dynamics  simulation  by  one  time  step 
as  function  of  problem  size  (number  of  rigid  bodies,  Nr ) 


Nm  =  0,Nf  =  0 


Nr 

(xlO3) 

0.49 

2.87 

16.59 

56.77 

118.23 

t 

(ms) 

5 

8 

16 

44 

78 

While  the  previous  results  show  linear  scaling  for  rigid  and  flexible  body 
dynamics,  in  actual  FSI  problems  in  which  rigid  and/or  flexible  bodies  inter¬ 
act  with  the  fluid,  the  simulation  time  is  virtually  independent  of  the  number 
of  solid  objects.  The  results  presented  in  Tables  4  and  5  were  obtained  on 
a  system  consisting  of  approximately  3  million  SPH  markers  by  varying 
the  number  of  rigid  bodies  and  flexible  bodies.  The  small  sensitivity  of  the 
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Table  2. 

Time  required  for  advancing  the  flexible  body  dynamics  simulation  by  one  time  step 
as  function  of  problem  size  (number  of  flexible  bodies,  Nj) 


Nm  =  0,  Nr  =  0 


Nf 

(xlO3) 

0.78 

3.51 

17.55 

56.94 

115.05 

t 

(ms) 

8 

14 

48 

122 

238 

simulation  time  with  respect  to  the  number  of  solid  objects  is  due  to  the  fact 
that  the  number  of  BCE  markers  associated  with  solid  bodies  represents  only 
a  very  small  fraction  of  the  number  of  SPH  discretization  markers,  the  latter 
overwhelmingly  dictating  the  required  computation  time.  Nevertheless,  as  the 
concentration  of  solid  objects  increases,  smaller  time  steps  are  required  since 
the  probability  of  short-range,  high-frequency  interactions  increases. 

Table  3. 


Time  required  for  advancing  a  fluid  dynamics  simulation  by  one  time  step 
as  function  of  problem  size  (number  of  SPH  markers,  Nm) 


K 

=  0,  Nf  = 

0 

Nm  (xlO6) 

0.06 

0.32 

0.93 

1.79 

4.13 

t  (ms) 

27 

121 

331 

538 

1150 

Simulation  time  per  time  step  for  an  FSI  problem  with  fixed  number 
of  SPH  markers  and  increasing  number  of  rigid  bodies 


Nm  -  3.0  x  106,  Nf  =  0 


Nr 

0 

36 

120 

480 

1800 

8400 

33600 

t  (ms) 

906 

919 

923 

925 

926 

926 

921 

Table  4. 


The  two  sets  of  results  provided  in  Table  5  for  different  values  of  r  = 
=  Afj  ph  /  At  an  c  f  show  a  small  increase  in  the  simulation  time  when  choosing 
a  very  small  integration  step  size  for  flexible  body  dynamics.  This  illustrates 
the  efficiency  of  the  multi-rate  integration  scheme  in  improving  the  time 
integration  stability  at  a  small  increase  in  computational  cost.  The  small 
changes  in  the  simulation  times  provided  in  Tables  4  and  5  are  mainly  due  to 
the  deviations  in  the  magnitude  of  Nm  as  the  number  of  solid  objects  changes. 
Linear  scaling  in  Chrono::Fluid  is  also  demonstrated  in  an  experiment  where 
a  combined  FSI  problem,  i.e.  involving  rigid  and  flexible  bodies  and  including 
a  lubrication  force  model  and  two-way  coupling  with  fluid,  is  solved  on 
domains  of  increasing  size  (see  Figure  6).  As  the  simulation  domain  volume 
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Fig.  5.  Scaling  analysis  of  Chrono::Fluid  for  rigid  body  dynamics,  flexible  body  dynamics,  and 
fluid  dynamics  as  function  of  problem  size  (Nr,  Nf ,  and  Nm,  respectively).  The  coefficients  R2  are 

specified  to  each  linear  regression 

is  increased  by  factors  from  2  up  to  32,  the  number  of  SPH  markers  varies 
from  about  76,000  to  more  than  2.5  million.  Simultaneously,  the  number 
of  rigid  and  flexible  bodies  grow  from  168  to  more  than  24,000  and  from 
160  to  almost  10,000,  respectively  (see  Table  6).  As  shown  in  Figure  6, 
Chrono::Fluid  achieves  linear  scaling  over  the  entire  range  of  problem  sizes. 

Table  5. 

Simulation  time  per  time  step  for  an  FSI  problem  with  fixed  number  of  SPH  markers  and 
increasing  number  of  flexible  bodies,  for  two  different  values  of  the  multi-rate  integration  factor 

r  =  AtSpH/AtANCF 


Nm  -  3.0  x  106,  Nr  =  0 


Nf 

0 

45 

140 

440 

1152 

2100 

4704 

r  =  10 

t  (ms) 

906 

923 

928 

916 

960 

950 

921 

r  =  50 

t  (ms) 

906 

973 

978 

965 

1066 

1060 

1060 
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Table  6. 

Simulation  time  per  time-step  for  combined  FSI  problems  of  increasing  size 


N„, 

(xlO6) 

0.08 

0.16 

0.29 

0.63 

0.95 

1.54 

2.50 

Nr 

(xlO3) 

0.17 

0.52 

1.12 

4.48 

7.84 

14.56 

24.64 

Nf 

(xlO3) 

0.16 

0.42 

0.84 

2.10 

3.36 

5.88 

9.66 

t 

(ms) 

45 

74 

120 

230 

343 

522 

820 

Fig.  6.  Scaling  analysis  of  Chrono::Fluid  for  FSI  problems:  simulation  time  as  function  of 
combined  problem  size.  In  this  experiment,  the  volume  of  the  simulation  domain  is  increased,  up 
to  32  times  the  volume  of  the  initial  domain,  leading  to  proportional  increases  in  the  number  of 
SPH  markers  and  of  solid  objects  (both  rigid  and  flexible  bodies)  as  shown  in  the  bottom  plot 
(see  also  Table  6).  As  illustrated  by  the  top  graph,  the  simulation  time  per  time  step  varies 
linearly  with  problem  size.  The  coefficients  R2  are  specified  for  each  linear  regression 

5.  Conclusions 

This  contribution  discusses  details  of  an  HPC  approach  to  the  simula¬ 
tion  of  FSI  problems.  Relying  on:  (i)  a  Lagrangian/Lagrangian  formalism 
for  the  simulation  of  the  fluid  and  solid  phases,  and  (ii)  GPU  parallelism, 
Chrono::Fluid,  the  software  implementation  of  the  proposed  solution  method- 
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ology,  is  suitable  for  the  study  of  multi-phase/multi-component  problems 
such  as  particle  suspensions,  polymer  flow,  and  biomedical  applications. 

Chrono::Fluid  has  been  shown  to  scale  linearly  in  the  dynamics  of  each 
phase  independently  as  well  as  in  the  overall  FSI  solution  implementation. 
The  dynamics  of  the  fluid  phase;  i.e.  the  time  propagation  of  the  SPH  mark¬ 
ers,  controls  the  overall  simulation  time,  and  neither  cross-phase  communi¬ 
cations  nor  solid  phase  simulation  have  a  notable  effect  on  the  simulation 
time.  This  represents  one  advantage  of  the  adopted  unified  FSI  methodology 
over  co-simulation  and  asynchronous  approaches.  Chrono::Fluid,  whose  pre¬ 
liminary  validation  is  discussed  in  [30],  is  open  source  and  freely  available 
under  a  BSD3  license. 

Several  directions  of  future  work  are  as  identified  as  follows:  (a)  revis¬ 
it  the  SPH  marker  numbering  scheme  in  order  to  improve  the  spatial  and 
temporal  memory  access  patterns,  thus  increasing  the  effective  bandwidth  of 
Chrono:: Fluid;  ( b )  augment  the  current  implementation  for  use  on  cluster  su¬ 
percomputers  that  adopt  a  non-uniform  memory  access  model  and  rely  on  the 
Message  Passing  Interface  standard;  (c)  investigate  problems  at  macroscale 
such  as  vehicle  fording  operations,  which  call  for  large  scale  simulations  at 
low  Reynolds  numbers  and  possibly  in  the  presence  of  very  large  viscosity; 
and  ( d )  gauge  the  potential  of  Chrono:  :Fluid  in  biomechanics  applications 
such  as  blood  flow  in  deformable  arteries  or  heart,  channel  occlusion  in 
stroke,  etc. 
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Zastosowanie  wysokowydajnej  techniki  obliczeniowej  (HPC)  do  symulacji  problemow 
interakcji  migdzy  plynem  i  ciatem  stalym  z  elementami  sztywnymi  i  elastycznymi 

Streszczenie 

W  pracy  przedstawiono  zarys  jednolitego  podejscia  do  bezposredniej  numerycznej  symulacji 
problemow  interakcji  plyn  -  cialo  state  (FSI)  z  wykorzystaniem  wielow^tkowej  wysokowydaj¬ 
nej  techniki  obliczeniowej  (HPC)  o  wielkiej  skali.  Algorytm  symulacji  opiera  si§  na  rozsze- 
rzonej  metodzie  hydrodynamiki  czqstek  gladkich  (XSPH),  ktora  opisuje  przeplyw  plynu  w  for- 
malizmie  Lagrange"  a  zgodnym  z  metod^  Lagrange’ a  sledzenia  fazy  stalej.  W  celu  modelowania 
sztywnego  i  elastycznego  ukladu  wielu  cial  implementowano  ogoln^,  trojwymiarowt)  dynamik^ 
ciala  sztywnego  i  zastosowano  sformulowanie  bezwzgl^dnych  wspolrz^dnych  w^zlowych  (ANCF). 
Dwukierunkowe  sprz^zenie  mi^dzy  plynem  i  fazq  stalq  jest  zamodelowane  przez  uzycie  znacznikow 
wymuszenia  warunkow  brzegowych  (BCE)  ktore  oddajq  dzialanie  sil  sprz^zenia  mi^dzy  plynem 
a  cialem  stalym  wymuszajqc  brak  poslizgu  w  warunkach  brzegowych.  Problem  interakcji  bliskiego 
zakresu  mi^dzy  plynem  i  cialem  stalym,  ktora  ma  decyduj^cy  wplyw  na  zachowanie  w  malej  skali 
mieszanin  plynow  i  cial  stalych,  rozwi^zano  przy  pomocy  modelu  sil  smarowania.  Stany  systemu 
zbiorczego  sq  integrowane  w  czasie  przy  uzyciu  jawnego,  wieloszybkosciowego  schematu.  By 
zmniejszyc  wielkie  obci^zenie  obliczeniowe,  w  algorytmie  ogolnym  polozono  nacisk  na  obliczenia 
rownolegle  w  kartach  procesorow  graficznych  (GPU).  W  pracy  przedstawiono  analiz^  wydajnos- 
ci  i  skalowania  dla  scenariuszy  symulacji  obejmuj^cych  jednt}  lub  wiele  faz  przy  liczbie  obiek- 
tow  stalych  si^gaj^cej  dziesi^tek  tysi^cy.  Implementacja  oprogramowania  przedstawionej  metody, 
o  nazwie  Chrono::Fluid,  jest  cz^sci^  projektu  Chrono  i  jest  udost^pniona  do  uzytku  nieodplatnego. 
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